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Abstract 

Models of codon evolution have attracted particular interest because of their unique capabilities to detect selection forces 
and their high fit when applied to sequence evolution. We described here a novel approach for modeling codon evolution, 
which is based on Kronecker product of matrices. The 61 x 61 codon substitution rate matrix is created using Kronecker 
product of three 4x4 nucleotide substitution matrices, the equilibrium frequency of codons, and the selection rate 
parameter. The entities of the nucleotide substitution matrices and selection rate are considered as parameters of the 
model, which are optimized by maximum likelihood. Our fully mechanistic model allows the instantaneous substitution 
matrix between codons to be fully estimated with only 19 parameters instead of 3,721, by using the biological inter- 
dependence existing between positions within codons. We illustrate the properties of our models using computer 
simulations and assessed its relevance by comparing the AlCc measures of our model and other models of codon 
evolution on simulations and a large range of empirical data sets. We show that our model fits most biological data 
better compared with the current codon models. Furthermore, the parameters in our model can be interpreted in a 
similar way as the exchangeability rates found in empirical codon models. 
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Introduction 

The recent advances in high-throughput sequencing tech- 
niques are providing researchers with a wealth of genome 
scale data, in particular for non model organisms, that is cre- 
ating a surge toward comparative genomics (Nielsen et al. 
2005; Anisimova and Liberies 2007; Studer et al. 2008; 
Barrett et al. 2009; Jing et al. 2010; Castoe et al. 2011; 
Dufresne and Jeffery 2011; Oh et al. 2012; Servin et al. 2012; 
Huang et al. 2013). Phylogenetic trees are a key element in this 
comparative framework, and their role has been strengthened 
by recent and new theoretical developments (Larget and 
Simon 1999; Ronquist and Huelsenbeck 2003; Felsenstein 
2004; Yang 2006; Blanquart and Lartillot 2008; Lartillot et al. 
2009; Rodrigue et al. 2010; Dib et al. 2014) that enable a better 
understanding of evolutionary processes occurring among 
species and genes (Aleshin et al. 2009; Cranston et al. 2009). 
However, the full potential of these new data sets will only be 
achieved by the developments of further statistical and math- 
ematical approaches. 

The statistical models that are currently in use to study the 
evolution of molecular data are designed to approximate, in a 
simplified form, the complex aspects of evolution (Felsenstein 
2004). This simplification allows us to understand more easily 
the key processes at play and identify the major driving forces 
behind them. Three types of models can be applied to current 
molecular sequences depending on the underlying data they 
are meant to analyze. The dimensionality of the parameter 
space for these models increases from a 4 x 4 substitution 



matrix for models of nucleotide evolution to 20x 20 for 
amino acids and, finally, to 61 x 61 codon models (stop 
codons being usually excluded). Depending on the type of 
data and application at hand, all or some of these models are 
applicable, and it is crucial to select a model that is appropri- 
ate and minimizes the potential discrepancies with the true, 
yet unknown, evolutionary process. 

The type of substitutions occurring at the nucleotide level 
is easily modeled with Markov models because of the rela- 
tively simple properties differentiating the nucleotides. 
Furthermore, they exhibit only four states, which render 
these models more tractable than amino acid or codon 
models from a computational point of view (Yang 2006). 
Their attractiveness due to lower computational complexity 
is reinforced by their wider applicability to any type of DNA 
sequences. For protein-coding DNA regions, however, treat- 
ing the 3-nt positions within codons as independent evolu- 
tionary units is an approximation that does not fully account 
for the inherent biological reality and can potentially mislead 
phylogenetic reconstructions (Shapiro et al. 2006; Bofkin and 
Goldman 2007; Seo and Kishino 2008; Christin et al. 2012). 

In contrast to nucleotide models, codons or amino acid 
models can be applied exclusively to protein-coding se- 
quences. The latter models are widely used to reconstruct 
evolutionary relationships among distantly related species be- 
cause they entail a lower risk of saturation (Anisimova and 
Kosiol 2009). However, amino acid models are not easily ame- 
nable to mechanistic statistical modeling because of the 
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physicochemical complexity of the relationship between 
amino acids. Such models are thus typically estimated from 
a priori defined empirical data sets (DayhofPet al. 1978; Jones 
et al. 1992; Whelan and Goldman 2001). The substitution 
rates between amino acids are thus not estimated directly 
from the specific data analyzed, which could lead to inaccu- 
racies during the tree reconstruction (Anisimova and Kosiol 
2009). Furthermore, statistical comparisons between amino 
acid and codon models indicate that synonymous substitu- 
tions are very informative (Salamin et al. 2005; Seo and 
Kishino 2008; Christin et al. 2012). Codon models may then 
be more appropriate than amino acid models for phyloge- 
netic inference even for highly divergent species. 

Unlike nucleotide models that have at most six substitu- 
tion rates, a fully generalized symmetric codon model would 
require to estimate 1,830 substitution rates (note that one 
substitution rate is usually fixed). It is statistically and com- 
putationally difficult to fit such a large number of parameters, 
and several approximations have been proposed to reduce 
this complexity. The implementations of current codon 
models have thus followed three main routes. 

First, mechanistic models reduce the biological complexity 
by restricting the parameter space to a small set of evolution- 
ary important parameters (Goldman and Yang 1994; Nielsen 
and Yang 1998; Pond and Muse 2005; Seoighe et al. 2007; 
Wong et al. 2006). The parameters typically include selection 
pressure and differential rates of transitions versus transver- 
sions (Yang 2006), although different rates of change between 
nucleotides in the three codon positions have been recently 
proposed (Zhou et al. 2010). Different implementations of 
these models exist (Goldman and Yang 1994; Muse and 
Gaut 1994). They have further been extended to allow vari- 
ation of selection pressure among codons and among lineages 
(Yang and Bielawski 2000; Zhang et al. 2005; Murrell et al. 
2012), which allows to explore a wide range of hypothesis 
testing. The advantage of these mechanistic models is to have 
a low number of parameters and be computationally tracta- 
ble, although they still require far more computational re- 
sources than nucleotide models (Schabauer et al. 2012). 

However, an important assumption of most codon models 
is that only a single nucleotide substitution per codon is al- 
lowed within a small interval of time, and double and triple 
substitutions within a codon are consequently ignored (Yang 
2006). This simplification might be unrealistic as multiple in- 
stantaneous substitutions have been observed in real DNA 
sequences. For instance, the best estimates of protein evolu- 
tion have nonzero instantaneous rates of change between 
amino acids whose codons differ by more than one nucleo- 
tide (Dayhoff et al. 1978; Whelan and Goldman 2001). This 
could be explained by highly frequent genomic events, such 
as codon bias or gene conversions, which can make these 
double or triple substitutions very likely (Aguileta et al. 
2004; Whelan and Goldman 2004; Drake 2007; Hershberg 
and Petrov 2008). The only fully parametric model that con- 
siders multiple instantaneous substitutions was developed by 
Whelan and Goldman (2004). It calculates substitution rate 
matrices for single-, double- and triple-nucleotide mutation 
separately using the equilibrium frequency of mutated 



nucleotides and a transition to transversion rate. The three 
matrices are then combined to calculate the general rate 
matrix of codons. Evaluation of this model on a large 
amount of coding sequences showed that this assumption 
improved the likelihood performance compared with the ex- 
isting mechanistic models (Whelan and Goldman 2004). 
Although the rates of double and triple substitutions have 
been estimated to be two to three orders of magnitude lower 
than single substitutions (Chuzhanova et al. 2003; Smith et al. 
2003; Whelan and Goldman 2004), these results further high- 
light the fact that they cannot be neglected (Whelan and 
Goldman 2004; Doron-Faigenboim and Pupko 2007). 

The second approach to introduce more biological realism 
into codon models is to use empirical information derived 
from existing databases to estimate the rate of substitutions 
between codons (Schneider et al. 2005; Zoller and Schneider 
2012). It was initially used to align DNA sequences and for 
homology searching but was extrapolated into a full model by 
deriving a substitution rate matrix from the probability matrix 
(Kosiol and Goldman 2005). The advantage of such a model is 
that it encapsulates the full biological variability present in the 
sequences used to build the matrix. However, as in the case of 
amino acid models, the transition probabilities estimated on 
some set of sequences do not necessarily represent the 
changes affecting a particular set of sequences. It has further 
been shown that the estimation of the substitution rates for 
deeply diverged species resulted in empirical codon substitu- 
tions that lacked accuracy due to limitations in the distance 
estimation (Schneider et al. 2005). 

Finally, the third approach is represented by empirical- 
mechanistic models, which complement the simplification 
of mechanistic models with a series of important parameters, 
such as exchangeability rates, that are estimated empirically 
from large databases. An implementation of such empirical- 
mechanistic models is the ECM model (Kosiol et al. 2007), 
which combines the parameters assumed in the mechanistic 
model MO (Goldman and Yang 1994) with physicochemical 
rates of amino acid substitutions estimated from the Pandit 
database (Whelan et al. 2003). It thus allows simultaneous 
substitutions between codons, and it was shown to outper- 
form the previous mechanistic and empirical models in phy- 
logenetic reconstruction (Kosiol et al. 2007). The unrestricted 
ECM model, which involves a large number of free parame- 
ters, was, however, computationally demanding even with 
fast optimization algorithms (Klosterman et al. 2006). 

A reduction in the number of parameters was also 
achieved by combining three existing empirical amino acid 
substitution matrices (JJT, mtRev, and cpRev by Jones et al. 
1992; Adachi and Hasegawa 1996; Adachi et al. 2000, respec- 
tively) with parameters representing the selective pressure 
and the rate of transition to transversion (Doron- 
Faigenboim and Pupko 2007). This model was shown to 
better fit large number of data sets spanning nuclear, viral, 
mitochondrial, and chloroplast genes than mechanistic and 
empirical codon models (Doron-Faigenboim and Pupko 
2007). This approach has led to the development of several 
other empirical-mechanistic models based on amino acid 
propensities (Delport et al. 2010; Rodrigue et al. 2010; De 
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Maio et al. 2013). However, the estimation and interpretation 
of the parameters borrowed from mechanistic approaches is 
difficult because the empirical parameters, which represent 
the known physicochemical attributes or exchangeability 
rates between codons, already incorporate some aspects of 
the substitution process (Anisimova and Kosiol 2009). 

Overall, models of codon evolution are still lacking gener- 
ality and the biological relevance of the current models has 
been questioned (Anisimova and Kosiol 2009). Here, we 
suggest a new mechanistic model of codon substitution. 
The model assumes a nucleotide substitution matrix for 
each nucleotide position in the codon and combines the 
three nucleotide substitution matrices using a matrix opera- 
tor to obtain the corresponding codon substitution rate 
matrix. This fully mechanistic model allows double and 
triple nucleotide substitutions within a codon without 
using any representative empirical data set. The performance 
of our model was assessed by using real and simulated data 
sets. The biological reality added by our model is important 
and has profound effects on the fit of the model to diverse 
data sets tested. It provides a novel direction for further 
extension and should prove useful for generalizing the esti- 
mation of selective forces on molecular data sets. 

New Approaches 

Codon Model Using Kronecker Product 
Codons are coded by three consecutive nucleotides that are 
free to vary and our approach to generalize the model is 
to assume that substitutions occurring within a codon are 
independent. The nucleotides present in the three codon 
positions can therefore change independently and instanta- 
neously. Each nucleotide / within a codon is further modeled 
by a symmetric substitution matrix q, that allows distinct 
rates for each type of substitutions. This is in essence similar 
to the general time-reversible (GTR) substitution matrix 
(Tavare 1986), although the state frequencies are not included 
here (see eq. 3 below): 
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where the rate of change between nucleotides j and k in the 
/th codon position is given by c/l^ and where c/l^ = qf. Note 
that each of the three matrices is not normalized at this stage 
and the 18 parameters (i.e., six per nucleotide positions; see 
below) are free to vary. 

The model of codon evolution (hereafter referred to as 
KCM) is obtained by combining the three matrices at each 
codon position using Kronecker product. The result of 
Kronecker product of the three consecutive 4x4 matrices 
at each codon position is a 64 x 64 matrix (hereafter referred 
to as Kronecker matrix), which represents, after some 
postprocessing described below, the rate transitions between 
any codons based on the underlying substitution rates of the 



nucleotides. The initial matrix includes substitutions from and 
to the three stop codons. We obtained a 61 x 61 Kronecker 
matrix for sense codons by removing the three rows and 
columns representing substitutions from and to the three 
stop codons. The Kronecker matrix is then multiplied by 
the diagonal matrix of the equilibrium frequencies of 
codons IT: 



^lf = (q,(^ (q2 (g) qsW 



(2) 



The substitution matrix that is finally representing how 
codons are changing through time is as follows: 
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where ^ is a 61 x 61 matrix, are substitution rates of 
nucleotide ] to k in the /th position of the codon, and 71^ 
stands for equilibrium frequency of codon m. The frequencies 
71^ can be any type of codon frequencies and can be esti- 
mated empirically from the data as usually done for codon 
models (Yang 2006). 

The KCM model can furthermore be extended to include 
the effect of selection (Goldman and Yang 1994). For every 
codon / and j, the parameter cd, which represents the ratio 
between synonymous and nonsynonymous substitutions, is 
introduced whenever the transition changes the amino acid 
coded by the codons, leading to the final matrix Q: 



0,= 



- co for a nonsynonymous substitution 
for a synonymous substitution 



■ (4) 



Finally, diagonal elements of the Q matrix are fixed to 
ensure that the row sums of Q equal zero and Q is scaled 
to obtain an average rate of substitution at equilibrium equals 
to 1. We therefore count double and triple substitutions as 
single events, which impose that the branch lengths are mea- 
sured in expected numbers of substitutions per codon. 

Comparison of Models and Model Selection 
The KCM model was first compared using sample corrected 
Akaike Information Criterion (AlCc) (Akaike 1974; Hurvich 
and Tsai 1989) with the MO model (Goldman and Yang 1994; 
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Yang 2006) by calculating mean delta AlCc values (repre- 
sented here as AICcmo ~ AICckcm)- The KCM model extends 
the MO model by allowing double and triple substitutions and 
assuming six different substitution rates at each codon posi- 
tion. We do not focus here on the estimation of the selective 
pressure along sequences, which is averaged over sites and 
lineages in the MO model. The main goal is instead to incor- 
porate double and triple substitutions and rate variation 
among positions in codon models. This has, however, the 
consequence to introduce some dependency between nucle- 
otide positions in a codon. The effects of this dependency 
were assessed by restricting the KCM model to include only 
single substitutions per codon by transforming the Q matrix 
with a binary parameter a G {0, 1} defined as follows: 

Qij for one nucleotide substitution, 

Qjj • a for more than one substitution . 

With a = 0, double and triple substitutions are not allowed 
and the KCM model is similar to MO model in this sense. With 
a = 1, the model allows double and triple substitutions and 
becomes more complex than the MO model. 

One of the main concerns in evaluating a model is the 
tradeoff between the number of free parameters and good- 
ness of fit (Burnham and Anderson 2002). The main version 
of KCM includes 19 free parameters (KCMig^; 6 parameters 
for each matrix and co), which is more than the parameters 
of current mechanistic models represented here by MO, which 
includes 2 parameters (/c and co). We investigated the effects 
of the number of parameters in the KCM model by simplify- 
ing the 4 X 4 q, matrices within the Kronecker framework 
(eq. 2; table 1). The simplest version, which is referred to as 
KCMyx, has one q, matrix that is Kronecker multiplied to itself 
three times. This model further allows only single substitu- 
tions per codon by setting a = 0 (KCMy^Mo nnodel; eq. 5). 
The KCMyx is still parameter rich in comparison with MO and 
has seven free parameters (six that are similar to all q, matrices 
and co). Following the same idea, KCMig^ can be modified by 
setting a (eq. 5) to 0, which leads to the KCMig^Mo nnodel. We 
used delta AlCc values to compare between the MO model 
and the different KCM variants. It should be noted that the 
codon frequencies, although estimated empirically from the 
data in all the implementations used here, add between 0 (if 
frequencies are assumed equal) and 60 parameters to the 
models. However, within a given type of codon frequencies. 



Table 1. Different Variants of the KCM Model. 



Model Description 


Number of 


Fixed 




Parameters 


Parameter 


KCM7XM0. r = qf^ 


7 


a = 0 


KCMi9;,Mo. r = (qi (g) qi) 0 q3 


19 


a = 0 


KCM^x, r = qf 


7 


a = 1 


KCMi9x r = (qi (g) q2) 0 qs 


19 


a = 1 


KCMig^neutral T = (^i 0 qi) 0 ^3 


18 


a = 1, (0 = 1 



Note. — The symbol refers to the type of substitution matrix at each nucleotide 
position. 



the increase in the number of parameters is identical for all 
the models compared and have therefore no impact when 
assessing the fit of the different models. 

The KCM model was also compared using AlCc with the 
MEC model (Doron-Faigenboim and Pupko 2007), which is a 
mechanistic-empirical model. To make MEC and KCM 
models more similar, we assumed one selection ratio for all 
sites under MEC model, which forces co = 1 and we called it 
MECneutrai- We Compared it with KCM under the constraint 
that 00 = 1. It was not possible to compare our new models 
with other approaches, such as the singlet-doublet-triplet 
(SDT) model (Whelan and Goldman 2004) and the ECM 
(Kosiol et al. 2007) models, because of difficulties with run- 
ning the existing implementations of these models. 

Finally, we tested whether the nucleotide substitution rates 
obtained under KCM models were the same as the nucleotide 
substitution rates obtained under a GTR nucleotide model. 
For this purpose, we compared the parameters ofq^ qj, and 
^3 obtained under KCM-based models with three GTR model 
rate matrices estimated from three nucleotide sequences that 
are generated by concatenating the first, second, and third 
codon positions. This can be done using the Mgene option of 
the CodeML software (Yang 2007). 

We assessed the performance of the mentioned models on 
5 simulated data sets, 3 known proteins, and 100 randomly 
selected empirical data sets (see Materials and Methods). The 
first simulated data sets were generated based on MO model 
substitution matrix (simulation A), whereas substitution ma- 
trices with double and triple rates randomly drawn from 
normal distributions were added to obtain simulations B, C, 
and D (see Materials and Methods). We also considered a fifth 
data set generated by a substitution matrix estimated from 
the Pandit database (ECM data set; Kosiol et al. 2007). For the 
empirical data sets, we first considered three known proteins 
that included the following: 1) (3-C/ob/>7 sequences containing 
144 codons for 17 vertebrate species extracted from Gen Bank; 
2) rbci sequences from plants containing 447 codons for 20 
species; and 3) pepC sequences also from plants containing 
437 codons for 20 species. We further estimated the different 
models on 100 empirical data sets randomly selected from 
the Selectome database (Moretti et al. 2013) whose co values 
ranged from 0.0054 to 8.66745 as estimated by the MO model 
with F3 X 4 codon frequencies. 

The performance of the different models was assessed on 
all data sets using three types of codon frequencies: F1 /16 (all 
codon frequencies are equal), F3 x 4 (nucleotide frequencies 
are estimates of the three codon positions and combined), 
and F61 (codon frequencies are estimated separately for each 
codon). 

Results 

Simulated Data Sets 

The different variants of the KCM model were first compared 
with the MO model based on different simulation schemes, 
which varied based on the level of double and triple substi- 
tutions present in the data (no double/triple substitutions for 
simulation A to approximately 11%, 30%, and 39%, on 
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average, for simulations B, C and D, respectively). Another set 
of simulations that we called ECM represented approximately 
25% of double and triple substitutions and was created by 
using a rate matrix derived empirically (Kosiol et al. 2007). 

The log likelihood obtained with the F3 x 4 codon 
frequencies for the different KCM models under simulation 
A were, on average, better than the MO model for all values of 
the CO parameter simulated. However, the mean delta AlCc 
differences for co = 1.0 were -18.71, -20.33, -4.16, and 
—5.70 for the difference between MO and KCMig^, 
KCMi9xM0/ KCMyx, and KCMyxMO/ respectively, and the rank- 
ing of delta AlCc did not change qualitatively with different co 
values (table 2). The better fit of the MO model in this case is 
clearly due to the larger number of parameters in each of the 
KCM variants. 

The simulations with no double/triple substitutions re- 
sulted in estimated co values that were overall similar to the 
expected ones (table 3). There is, however, a tendency for 
large co values to be overestimated and small co values to 
be underestimated when the model considered double and 
triple substitutions (i.e., KCMy^ and KCMig^ for simulation A; 
table 3). We therefore applied a correction to the co param- 
eter for the KCMi9x model because of the variation in rates 
among nucleotide positions within codons in this model and 
the allowance of double and triple substitutions (eqs. 6 and 7; 
see also Goldman and Yang 1994). Although, this correction 
did not change the pattern observed (corrected co values for 
KCM19;, under simulation A were 0.114, 0.543, 1.137, 2.308, 
and 14.194, for 00 = 0.1, 0.5, 1.0, 2.0, 10.0, respectively), 
the variance in co values obtained always included the ex- 
pected value simulated. 



The pattern obtained is very different once some 
amount of double and triple substitutions are introduced 
and drastic increase in log-likelihood values were observed, 
reflected in the mean delta AlCc, for the variants of the 
KCM model (simulations B-D and ECM; table 2). This was 
irrespective of the amount of double and triple substitutions 
present in the data. The different KCM variants clearly out- 
performed the MO model with a difference in mean delta 
AlCc for KCM19;, and KCM7;, between 214.367 and 223.405 
(simulations B-D and ECM; table 2). Forcing the KCM models 
to ignore any substitutions implying double and triple 
substitutions (i.e., KCMig^Mo and KCM7;^mo) resulted in log 
likelihood that were comparable with the MO model (simu- 
lations B-D and ECM; table 2) and mean delta AlCc be- 
tween -9.868 and -3.5925 (simulations B-D and ECM; 
table 2). 

The large differences seen when comparing, on one hand, 
KCMi9x with KCMi9xMo and, on the other hand, KCMy^ with 
KCMi9xKCM7xMo indicate that the number of parameters in 
itself was not responsible for the large improvement of the 
delta AlCc observed with models allowing double and triple 
substitutions. 

Changing the type of codon frequencies from F3 x 4 to F 
1/61 had a drastic effect on the fit of the models when 
we analyzed the simulations with low amounts of double 
and triple substitutions (simulation A; supplementary fig. SI, 
Supplementary Material online). Under equal codon frequen- 
cies, the mean delta AlCc of KCM19X became much better 
than MO (from —18.71 to 2.07 for 00 = 1), whereas the mean 
delta AlCc of KCMy;^ changed from —4.16 to —2.75, again for 
00 = 1, even though this data set was simulated based on the 



Table 2. Mean Delta AlCc (standard deviation) Over the 50 Replicates for the Different Simulations Performed 



Simulations 


Model 






Mean Delta AlCc 










Factor = 0.1 


Factor = 0.5 


Factor = 1.0 


Factor = 2.0 


Factor = 10.0 


A 


KCMi9x 


-20.84(6.69) 


-20.39(6.38) 


-18.71(7.50) 


-20.23(6.42) 


-17.98(7.84) 




KCMyxMO 


-20.36(6.33) 


-21.77(5.94) 


-20.33(6.94) 


-21.55(6.05) 


-18.66(7.96) 




KCM7X 


-5.77(3.03) 


-4.14(4.55) 


-4.16(4.81) 


-4.35(4.51) 


-5.18(3.52) 




KCM7XM0 


-5.82(3.20) 


-5.37(3.57) 


-5.70(3.79) 


-5.75(3.85) 


-6.14(3.32) 


B 


KCMi9x 


3.78(18.35) 


40.60(23.19) 


44.49(21.63) 


55.23(26.96) 


65.80(29.51) 




KCM7XM0 


-22.04(4.84) 


-20.22(4.73) 


-19.81(5.42) 


-19.46(5.70) 


-19.97(6.11) 




KCMy^ 


18.63(17.77) 


55.87(22.88) 


59.94(21.80) 


70.63(27.70) 


79.54(28.93) 




KCMyxMO 


-8.00(1.74) 


-5.71(3.07) 


-4.61(3.31) 


-4.23(3.26) 


-5.43(3.04) 


C 


KCMi9x 


134.83(35.01) 


210.45(38.49) 


243.46(37.59) 


260.84(42.86) 


302.18(45.76) 




KCMi9xM0 


-14.50(7.17) 


-20.06(4.60) 


-20.41(5.54) 


-17.94(7.72) 


-21.78(6.15) 




KCM7X 


146.46(34.20) 


226.87(38.12) 


259.05(37.04) 


273.08(43.08) 


316.93(44.27) 




KCM7XM0 


-7.32(2.98) 


-6.04(2.68) 


-5.87(3.72) 


-6.13(2.84) 


-5.64(2.99) 


D 


KCMi9x 


213.24(52.33) 


320.78(57.01) 


343.11(46.81) 


407.34(55.21) 


449.84(53.37) 




KCMi9xM0 


-11.48(32.64) 


-14.77(27.79) 


-18.55(6.76) 


-20.07(7.67) 


-21.28(5.73) 




KCM7X 


227.22(53.05) 


336.20(57.97) 


357.53(46.39) 


421.73(55.51) 


466.92(52.47) 




KCM7XM0 


-2.30(30.35) 


-2.35(26.98) 


-6.80(2.48) 


-7.04(2.73) 


-5.65(3.76) 


ECM 


KCMi9x 


93.45(17.80) 


164.37(31.97) 


235.62(43.36) 


299.15(42.84) 


398.78(35.68) 




KCMi9xM0 


12.19(9.03) 


9.31(9.84) 


13.79(12.77) 


17.14(11.68) 


32.55(13.43) 




KCM7X 


98.15(18.31) 


162.18(28.51) 


229.84(41.43) 


287.54(39.80) 


373.72(34.05) 




KCM7;,M0 


0.40(4.65) 


-0.83(5.35) 


-0.38(4.26) 


0.34(4.66) 


11.74(6.67) 



Note. — The analysis is reported for F3 x 4.The term factor refers to the constant used to multiply the co parameter in the different simulations. 
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Table 3. Mean m (standard deviation) over the 50 Replicates for the Different Simulations Performed. 



Simulations Model Simulated Parameter 







Factor = 0.1 


Factor = 0.5 


Factor = 1 .0 


Factor = 2.0 


Factor = 1 0.0 


A 


MO 


0.110(0.013) 


0.520(0.054) 


0.978(0.104) 


2.030(0.224) 


8.434(1.440) 




KCMiQv 


0.114(0.023) 


0.543(0.136) 


1.137(0.336) 


2.308(0.516) 


14.194(9.335) 




KCMfg^^O 


0.115(0.023) 


0.540(0.133) 


1.125(0.332) 


2.233(0.469) 


10.240(2.918) 




KCM7V 


0.108(0.015) 


0.496(0.061) 


0.978(0.117) 


2.182(0.280) 


13.456(8.282) 




KCM7«AAn 


0.110(0.014) 


0.512(0.054) 


0.980(0.109) 


2.042(0.227) 


8.458(1.453) 


B 


MO 


0.132(0.014) 


0.549(0.066) 


1.027(0.137) 


1.724(0.407) 


4.911(1.710) 




KCMiQv 


0.102(0.030) 


0.425(0.092) 


0.839(0.161) 


1.584(0.376) 


14.007(15.299) 




KCMi9x/\/\0 


0.108(0.034) 


0.426(0.109) 


0.811(0.176) 


1.393(0.510) 


4.228(1.977) 




KCM^v 


0.109(0.013) 


0.490(0.069) 


0.998(0.144) 


1.917(0.432) 


20.627(35.725) 




KCM-7wA/in 


0.132(0.015) 


0.551(0.074) 


1.025(0.148) 


1.706(0.425) 


4.899(1.745) 


c 


MO 


0.185(0.017) 


0.607(0.067) 


0.986(0.148) 


1.407(0.200) 


2.524(0.536) 




KCMiQv 


0.079(0.020) 


0.365(0.071) 


0.833(0.286) 


1.432(0.467) 


10.494(5.773) 




KCMi9xM0 


0.108(0.032) 


0.388(0.076) 


0.733(0.266) 


1.016(0.350) 


2.045(0.944) 




KCM-7V 


0.096(0.014) 


0.448(0.063) 


0.948(0.201 ) 


1.765(0.461) 


14.205(8.130) 




KCM-7wA/in 


0.185(0.018) 


0.618(0.076) 


1.002(0.152) 


1.420(0.230) 


2.588(0.524) 


D 


MO 


0.241(0.198) 


0.845(1.017) 


1.016(0.150) 


1.301(0.202) 


1.692(0.244) 






0.078(0.019) 


0.405(0.103) 


0.744(0.161) 


1.632(0.623) 


9.559(6.274) 






0.120(0.037) 


0.427(0.103) 


0.591(0.168) 


0.849(0.257) 


1.270(0.629) 




KCM7X 


0.091(0.013) 


0.483(0.090) 


0.960(0.216) 


2.001(0.731) 


11.149(7.188) 




KCMyxMO 


0.209(0.025) 


0.706(0.092) 


1.036(0.166) 


1.334(0.221) 


1.687(0.254) 


ECM 


MO 


0.026(0.003) 


0.130(0.011) 


0.241(0.026) 


0.443(0.067) 


1.555(0.383) 




KCMis,x 


0.006(0.003) 


0.034(0.008) 


0.066(0.015) 


0.122(0.022) 


0.633(0.163) 




KCMi9xM0 


0.009(0.005) 


0.052(0.014) 


0.102(0.031) 


0.192(0.044) 


0.698(0.240) 




KCM7X 


0.008(0.002) 


0.053(0.006) 


0.099(0.016) 


0.194(0.026) 


1.108(0.296) 




KCM7XM0 


0.026(0.003) 


0.123(0.011) 


0.226(0.025) 


0.427(0.064) 


1.460(0.359) 



Note. — The values given for the KCMig^ and KCMy^ models are the uncorrected ones (see text). The analysis is reported for F3 x 4. The term factor refers to 
the constant used to multiply the co parameter in the different simulations. 



Table 4. Estimated Log Likelihood (In L), AlCc, and oj Parameters (corrected for the KCMigx model) for Vertebrate P-Globin and the Plants rbcL 
and pepC Genes. 



Models 




P'Globin 






rbcLlO 






pepC20 




-InL 


AlCc 


n 


-InL 


AlCc 


0} 


-InL 


AlCc 


0} 


MO 


3,815.5 


7,635.1 


0.23685 


4,362.7 


8,729.4 


0.10116 


9,783.4 


19,571 


0.06597 


KCM7XM0 


3,799.9 


7,614.1 


0.20640 


4,336.70 


8,687.5 


0.08671 


9,734.95 


19,484 


0.06093 


KCMi9xM0 


3,710.5 


7,460.8 


0.1409 


4,301.86 


8,642.3 


0.0832 


9,595.99 


19,231 


0.0443 


KCM7X 


3,694.78 


7,403.8 


0.11417 


4,297.31 


8,608.7 


0.06580 


9,541.91 


19,098 


0.04428 


KCMi9x 


3,601.16 


7,242.2 


0.0706 


4,263.39 


8,565.4 


0.0652 


9,367.98 


18,775 


0.0291 


'^ECneutral 


3,840.4 


7,697.9 


1.129 


4,557.3 


9,130.9 


1.189 


10,504.9 


21,026.9 


1.139 


KCMi neutral 


3,659.9 


7,361.3 


1.000 


4,328.5 


8,693.5 


1.000 


9,649.0 


19,335.0 


1.000 



substitution matrix of MO itself. This result suggests that the 
MO and KCMy^ models, which do not allow rate variability 
within codon positions, depend heavily on the codon frequen- 
cies to compensate the model restrictions. When the amount 
of double and triple substitutions increases further (simula- 
tions B, C D, and ECM), the fit of the KCM variants is much 
higher than MO regardless of the frequency mode. This sug- 
gests that the ability of the KCM models to incorporate 
double and triple substitutions becomes much more relevant 
than the rate variation within codons (fig. 1 and supplemen- 
tary figs. SI and S2, Supplementary Material online). The 



results for the F61 codon frequencies are similar to those for 
F3 X 4 and are therefore not shown. 

The estimated co parameters under the simulation 
schemes B-D and ECM were correctly estimated by 
KCMi9x and KCMy^ models, whereas a strong bias was intro- 
duced when the models did not allow double and triple sub- 
stitutions (table 4). This bias is either an overestimation, when 
the data are simulated under purifying selection 
(00 = 0.1, 0.5) or a strong underestimation in the presence 
of positive selection (co = 2.0, 10.0). Although the exact co 
value for the ECM simulations is not known in advance 
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— I 1 1 r 1 1 1 r 

0.02 0.14 1.00 7.39 0.02 0.1 4 1.00 739 

KCM19X KCM19){M0 




0.02 0.14 1.00 739 0.02 0.14 1.00 7.B9 



KCM7X KCIWZxMO 

Fig. 1. Delta AlCc plots comparing the performance of the MO model 
with KCM models (KCMy^, KCMyxMo. KCMig^, KCMig^Mo) on 100 em- 
pirical data sets randomly selected from Selectome database. For each 
empirical data set, we evaluated the maximum-likelihood value of the 
MO model and the KCM variants and compared the delta AlCc to 
penalize the 2 free parameters of the MO model and the 7, 7, 19, and 
19 free parameters of the KCM models, respectively. For each plot, a 
black horizontal line is drawn for the mean delta AlCc value of the 
empirical data set. The codon frequencies used were the products of the 
observed nucleotide frequencies at each of the three codon positions 
(f3 X 4; Yang and Bielawski 2000). Empirical data sets with delta 
AlCc < 4 are shown in red 



(Kosiol et al. 2007), we observed that the co value estimated 
by the KCAAig^ and KCMy^ models increased linearly with 
the factor that we applied to co, whereas the MO model 
was not sensitive to this factorization of the co parameter 
(table 4). 

The model comparisons showed that KClsA^^x is able to 
outperform standard models of codon evolution. However, 
these evaluations do not inform about the accuracy of the 
parameters estimated by our new model and their biological 
relevance. This was assessed by simulating data sets under 
either the MO or ECM models and comparing the rates of 
substitutions of codons obtained with the MO, KCMig^, and 
ECM models. The distribution of the median values over the 
50 simulations for each entity of the 61 x 61 rate matrix 
showed that KCMig^ substitution rates can approximate 
either the MO (supplementary fig. S3, Supplementary 
Material online) or the ECM (supplementary fig. S4, 
Supplementary Material online) model. In contrast, ECM 
and MO are not able to represent as well the substitution 
rates of MO and ECM, respectively. Indeed, the Euclidian dis- 
tances measured on the entities of the 61 x 61 matrix were 
smaller for KCMig^ than for any of the other alternative 
models (i.e., 1.005 for KCMig^ versus 2.281 for ECM on simu- 
lations based on the MO model; 1.812 for KCM19;, versus 2.789 
for MO on simulations based on the ECM model). Moreover, 
the branch lengths estimated by KCMig^ are correlated with 
the ones obtained by the model used for the simulations (0.69 
with MO and 0.90 for ECM). 



For computational simplicity, we simulated data sets that 
contained only 150 codons alignments. This could have an 
impact on the estimation of the parameters of the models 
and we therefore checked for simulations A, B, and ECM if our 
estimation of the co parameter (table 4) was comparable with 
simulations done on 3,500 codons. The median of the co was 
the same for both alignment lengths and a reduction in the 
variance of the parameter estimates was observed with larger 
alignments (supplementary table SI, Supplementary Material 
online). The use of 150 codon alignments should, therefore, 
not affect our main conclusions. 

Empirical Data Sets 

The performance of the KCM model and its extensions on the 
three empirical data sets are shown in table 4. In each case, 
the best model, as measured with the AlCc criteria, was 
KCMi9x, which allows both a different substitution matrix 
per codon position and double and triple substitutions to 
occur (table 1). Overall, the KCM variants outperformed 
the MO model, when allowing selective pressure to be esti- 
mated, or MECneutrai when CO was fixed to 1 in KCM (table 4). 

For the ^-g/ob/n gene, the AlCc for the KCMy^Mo model 
was lower than the MO model by 21.0. Extending the 
KCMyxMo nnodel to allow variable substitution rates among 
nucleotide positions within a codon (i.e., KCMig^Mo) further 
improved the AlCc by 153.30, whereas allowing double and 
triple substitutions (i.e., KCMy^) resulted in a reduction in 
AlCc of 210.30 (table 4). Combining these two aspects into 
the more general KCMig^ model clearly gave the best AlCc 
value for the ^-g/ob/V? data sets and resulted in a reduction of 
KCMyxMo AlCc by 371.90. The same trend was observed for 
the two plant data sets tested here (table 4). 

The P'globin gene was also used to compare the three GTR 
matrices estimated by partitioning the sequences into inde- 
pendent codon positions (Mgene option in CodeML) and the 
q, matrices obtained from the KCMy^Mo and KCMig^ models. 
The substitution rates estimated by KCMig^Mo were very sim- 
ilar to those estimated for each codon positions and the 
mean relative error per matrix entities was equal to 5.8% 
with no obvious distinction between codon positions. 
Allowing double and triple substitutions by using the 
KCMi9x model changed the pattern, and the mean relative 
error per entities increased to 22.6%. The largest discrepancies 
observed between the KCMig^ matrices and the three in- 
dependent GTR matrices were found at the first position, 
followed by the third codon positions (data not shown), 
whereas the values of the rate matrix for the second position 
were highly similar between the two models. 

The importance of incorporating double and triple substi- 
tutions can be further understood by looking at the substi- 
tution rates found in each 61 x 61 matrix. Double and triple 
substitution rates estimated under KCMig^ ranged from 
<0.001 to 0.145 for ^-globin, from <0.001 to 0.121 for rbcL, 
and from < 0.001 to 0.272 for pepC. A graphical representa- 
tion of the rate matrices estimated by KCMig^ for ^-globin 
is shown in the supplementary fig. S5, Supplementary 
Material online. For each empirical data sets, the number 
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of double and triple substitutions with rates > 0.001 is 
high (44.2%, 28.3%, and 40.0% for ^-giobin, rbcL, and pepQ 
respectively), and there are several double and triple 
substitutions with higher substitution rates than single sub- 
stitutions (supplementary fig. S5, Supplementary Material 
online). 

In contrast to the simulated data sets, the estimated ratio 
of nonsynonymous versus synonymous substitutions (co) was 
smaller under the different KCM models than with MO for all 
empirical data sets tested. The incorporation of more com- 
plexity in the KCM variants lead to a sharp reduction in the 
estimated co values for each data sets tested (table 4). 
Allowing both rate variation among codon positions and 
multiple nucleotide substitutions per codons (i.e., KCMig^; 
table 4) reduced the estimated co by half or more in all em- 
pirical data sets. This is similar to what was observed for the 
simulations B-D under low co values (table 4) and would 
suggest that double and triple substitutions are present in 
those empirical data sets. 

The MO model is certainly making strong assumptions that 
might not be relevant for the empirical data tested here and 
the improvements seen with the KCM variants might there- 
fore be expected. The MEC model is taking another route to 
achieve biological realism, but our results show that the 
KCMi9;^ neutral HTiodel is also fitting the different empirical 
data sets much better (table 4). For instance, the reduction 
in AlCc obtained with KCM^g^^ neutral on the l3'globin gene 
reached 336.6 over the MECneutrai nnodel (table 4). Again, 
similar reductions in AlCc were observed for the two plant 
data sets (table 4). 

The Almost Invariant Sites (AIS) analyses on the two plant 
data sets illustrate the biological consequences of the substi- 
tution matrix of codons obtained by the KCM model. The 
creation of 20 codon classes resulted in a clear assemblage of 
codons into their amino acids and a high similarity between 
the clusters was found by analyzing either KCMig^ or MO 
substitution matrix (table 5). This was particularly true for 
the rbcL data set, whereas the pepC data set showed slight 
differences (e.g., amino acids phenylalanine [F] and isoleucine 
[I] are grouped together with KCM but not with MO; table 5). 
Similar results were obtained when the partitioning was done 
based on the seven physicochemical properties of amino 
acids (table 6). 

Finally, we compared the KCM and MO models on 100 
empirical data sets randomly selected from the Selectome 
database (Moretti et al. 2013). We showed that KCM variants 
outperform the other models in all but two data sets (fig. 1 
and supplementary fig. S6, Supplementary Material online). 
Furthermore, the estimation of the average rate of transitions 
and transversion by KCMig^ and KCMy^was very close to the 
MO model (supplementary fig. S7, Supplementary Material 
online), which suggests that KCM models can capture 
biologically relevant aspects of the evolution of these pro- 
tein-coding genes. For F1/61 codon frequencies, the mean 
value of delta AlCc for the 100 empirical data sets evaluated 
with the KCMi9x model was 151.98, whereas it was 53.05 for 
KCMyx (supplementary fig. S8, Supplementary Material 
online). These values changed when we used the F3 x 4 



codon frequencies with mean delta AlCc increasing to 
214.41 and 111.84 for KCM19;, and KCM7;,, respectively 
(fig. 1). The results for the F61 codon frequencies are similar 
to those for F3 x 4 and are not shown. 

Discussion 

We proposed a new model of codon evolution that incorpo- 
rates rate variation within codon positions and allows double 
and triple substitutions between codons. It represents an at- 
tempt to capture the real processes behind protein-coding 
sequence evolution and generalizes the current mechanistic 
models without relying on any empirical data. 

A fully parametric model of codon substitution defined by 
a 61 X 61 rate matrix would require the estimation of 1,830 
parameters if the model is assumed symmetrical. The KCM 
model described here reduces this parameter space through 
the use of the Kronecker product of three consecutive 4x4 
nucleotide substitution matrices, that is, q^, q2, and (eq. 3). 
This allows the substitution process to be modeled by only 18 
rate parameters and one selection parameter co. The param- 
eters of the q, matrices represent the contribution of the 
corresponding nucleotide substitutions to the evolution of 
codons. However, the resulting substitution rates of codons 
are not simply multiplications of substitution rates of the 
three corresponding nucleotide substitution rate matrices 
as would be done by assuming that the three positions 
evolved independently. Dependency between codon posi- 
tions is well-known (Robinson et al. 2003; Whelan 2008), 
and our model incorporates some of this dependency 
during the building of the codon substitution matrix through 
the Kronecker product. In contrast to context-dependent 
models (Baele et al. 2011) or to the use of codon partitions 
(like the option Mgene in CodeML), our approach maintains 
the codon as the unit of evolution and, in addition to nucle- 
otide substitution parameters, equilibrium frequency of 
codons, that is, F61, F3 x 4 model, and the branch lengths 
play roles in estimating the substitution rates of codons. 
However, our model cannot incorporate the effect of neigh- 
boring bases on the substitution process within codons 
(Fedorov et al. 2002; Morton and Wright 2007). 

The KCM-based models explained the three real data sets 
better than the MO model considered in this article. Our 
results highlight that allowance of double and triple nucleo- 
tide substitutions in a codon at each time interval is an im- 
portant aspect to model the evolutionary process underlying 
protein-coding sequence data. This is in line with a recent 
study that showed that the most relevant parameters for 
codon models are co and double and triple substitutions 
(Zoller and Schneider 2010). On the other hand, the 
KCMi9xNeutrai Tiodel explained the empirical data sets 
better than the MECneutrai nnodel, even though the MEC 
model uses empirical data and also allows double and triple 
substitution. This suggests that rate variation among nucleo- 
tides within a codon, which is a component present in our 
KCM model but not incorporated explicitly in the MEC 
model, is another relevant aspect to consider beside those 
proposed by Zoller and Schneider (2010). Our results show 
the utility to generalize codon models of substitution and 
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Table 6. Partitions of Codons into Seven Categories Based on Substitution Rate Matrix of Two Genes Obtained Under KCMig^ and MO Models 
Using AIS Algorithm. 



Genes 


AIS Analysis 




KCM 


MO 


Sets for 
rbcL 


{F (TTT) F (TTC) L (TTA) L (TTG) S 

(TCT)S (TCC)S (TCA)S (TCG)L (CTT)L 
(CTC)L (CTA)L (CTG) P (OCT) P (CCC) P 
(CCA)P (CCG)I (ATA)V (GTT)V (GTC)V 
(GTA) V (GTG) } 

{Y (TAT)Y (TAC)H (CAT)H (CAC)N (AAT)N 

(AAC)D (GAT)D (GAG) } 
{C (TGT)G (TGG)W (TGG)S (AGT) S (AGG)G 

(GGT)G {GGC)G (GGA)G (GGG) } 
{Q (GAA)Q (GAG)K (AAA)K (AAG) } 

{R (GGT)R (GGG)R (GGA)R (GGG)T (AGT) T 
(AGG)T (AGA)T (AGG)R (AGA) R (AGG)A 
(GGT) A (GGG)A (GGA)A (GGG)} 

{I (ATT)I (ATG)M (ATG) } 

{E (GAA)E (GAG) } 


{F (TTT) F (TTG) L (TTA)L (TTG) S 

(TGT)S (TGG)S (TGA)S (TGG)L (GTT)L 
(GTG)L (GTA)L (GTG) N (AAT)N (AAG)K 
(AAG)S (AGT)R (AGA)D (GAT)D (GAG)E 
(GAA)E (GAG)G (GGT)G {GGC) G (GGA)G 
(GGG) } 

{Y (TAT)Y (TAG)G (TGT)G (TGG)H (GAT)H 

(GAG) S (AGG) } 
{W (TGG) V (GTG) } 

{Q (GAA)Q (GAG)R (GGA)R (GGG)K (AAA)R 
(AGG) } 

{P (GGT) P (GGG)P (GGA)P (GGG)R (GGT)R 
(GGG) } 

{I (ATT)I (ATG)I (ATA)M (ATG)} 

{T (AGT) T (AGG)T (AGA) T (AGG)V (GTT)V 

(GTG)V (GTA)A (GGT) A (GGG)A (GGA)A 

(GGG) } 



{F (TTT) F (TTG) L (TTA) S (TGT) S 

(TGG)S (TGA)S (TGG) I (ATT) I (ATG) I 
(ATA)T (AGT) T (AGG)T (AGA)T (AGG)V 
(GTT)V (GTG)V (GTA)V (GTG)} 

{L (TTG)L (GTT)L (GTG)L (GTA)L (GTG) P 
(GGT) P (GGG)P (GGA)P (GGG)M (ATG)} 



{Y (TAT)Y (TAG)G (TGT)G (TGG)S (AGT)S 
(AGG) } 

{W (TGG)R (GGT)R (GGG)R (GGA)R (GGG)R 
(AGA)R (AGG)G (GGA)G (GGG)} 

{H (GAT)H (GAG)Q (GAA)Q (GAG)N (AAT)N 
(AAG)K (AAA)K (AAG)D (GAT)D (GAG)E 
(GAA)E (GAG) } 

{A (GGT) A (GGG)A (GGA)A (GGG)} 

{G (GGT)G (GGG) } 



{F (TTT) F (TTG) L (TTA)L (TTG)L 

(GTT)L (GTG)L (GTA)L (GTG) H (GAT)H 
(GAG)R (GGG)R (GGA)R (GGG)M (ATG) V 
(GTA) V (GTG) } 

{S (TGT)S (TGG)S (TGA) S (TGG) P (GGT) P 
(GGG)P (GGA)P (GGG)T (AGT) T (AGG)T 
(ACA)T (AGG)V (GTT)V (GTG) A (GGT) A 
(GGG) A (GGA) A (GGG) } 

{Y (TAT)Y (TAG)G (TGG)N (AAT)N (AAG)D 
(GAT)D (GAG)E (GAA)E (GAG)G {GGC) G 
(GGG) } 

{G (TGT)R (GGT)S (AGG)G (GGT)G (GGA)} 
{I (ATT) I (ATG) I (ATA)} 



{W (TGG) S (AGT) } 

{Q (GAA)Q (GAG)K (AAA) K (AAG)R (AGA) R 
(AGG) } 



suggest that the KCM-based models that we propose are able 
to incorporate the most important factors acting on the evo- 
lution of protein-coding genes (tables 2 and 4). 

Our KCM models, in particular the KCAAig^, are more 
parameter rich than MO and MECpgutrai models. 
Overparameterization can therefore be considered as a 
possible explanation for the better fit observed in our re- 
sults. We have chosen AlCc measures in our model com- 
parisons to reduce the risk that the decrease in likelihood 
of the KCM models could be the result of a larger number 
of parameters. Simulation A gives some insight into the 
effect of overparameterization in our models. Because the 
data were simulated under the MO model, all codon posi- 
tions have the same substitution rate matrix and no dif- 
ferentiation between nucleotide positions is introduced. In 
this case, one would expect the delta AlCc values for the 
KCMi9x model to be lower than or at least very similar to 
those for KCMy^. 



The different variants of the KCM models allow, however, a 
better exploration of the effects of the increased number of 
parameters. In particular, comparisons between the KChA^^x 
or KCMyx and their respective variants allowing only single 
substitutions (i.e., KCMig^Mo and KCMy^MO/ respectively), 
which have the same number of parameters, suggest that 
the effect of the number of parameters, although present, is 
minimized (tables 2 and 4). The changes in likelihood values 
are due to the allowance of double and triple substitutions in 
the substitution process. Our results concur with other stud- 
ies (Whelan and Goldman 2004; Doron-Faigenboim and 
Pupko 2007; Kosiol et al. 2007) and reinforce the need to 
better explore the effects of biological realism in models of 
codon evolution. 

Although different KCM variants obtained better AlCc 
values when compared with other models, it is important 
to understand whether the codon substitution rate matrix 
obtained by our models is biologically meaningful and 
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represents the substitution patterns expected between amino 
acids. This was checked by first extracting the codon substi- 
tution rate matrix obtained under the KCAAig^ model for the 
l3'globin gene and generating a schematic representation of 
the matrix (supplementary fig. S5, Supplementary Material 
online). It had a very similar configuration than the published 
representations (Kosiol et al. 2007; De Maio et al. 2013), which 
suggest that the distribution of codon substitutions in the 
KCMi9x model is biologically plausible. Second, the biological 
relevance of the codon substitutions was assessed by the AIS 
software (Lio and Goldman 1998) applied, again, to the 
KCMi9x codon substitution rate matrix obtained from the 
two plant data sets (tables 5 and 6). The small variations 
observed between real codon sets and partitions obtained 
under the KCAAig^ model was also seen under the MO 
model. This suggests that model assumptions are not causing 
these deviations, but that it is either due to the nature of the 
data sets or to some common assumptions of mechanistic 
models of codon evolution. Moreover, by comparing KCM 
substitution rates with standard models when the data are 
simulated under MO and ECM models, we showed that KCM 
not only outperforms standard models but that it also cor- 
rectly estimates the parameter values. 

Beside the ability to account for double and triple substi- 
tutions, the KCMi9x variant of our model can also capture the 
rate variability existing between the three positions within a 
codon. There has been several attempts to incorporate the 
variability among sites in codon models (Pond and Muse 
2005; Mayrose et al. 2007; Rubinstein et al. 2011) and to 
allow both synonymous and nonsynonymous rates to vary. 
However, only empirical models that estimate exchangeability 
rates from existing data have been extending this variability to 
different positions within a codon (Kosiol et al. 2007). Here, 
we provide further evidence that this variation might be im- 
portant and that an accurate modeling of protein evolution 
should go beyond the simple consideration of synonymous 
versus nonsynonymous changes. For example, the investiga- 
tion of multilayered selective pressure (Rubinstein et al. 2011) 
that model the level of selection at the protein and DNA/ 
RNA levels is of interest in this context. 

In our results, the difference between AlCc values of MO 
and KCMyx captures the model ability to take into account 
the double and triple substitutions. The difference between 
the AlCc values of KCMy^ and KCMig^, that both incorpo- 
rates double and triple substitutions, shows the model ability 
to take into account the variability within the codon posi- 
tions. The advantage of the KCMig^ is already evident in the 
ECM simulations (table 2), but this effect is increased drasti- 
cally for empirical data sets (fig. 1). The better performance of 
KCMi9x over the restricted model KCMy^ is observable for all 
types of codon frequencies and especially with F3 x 4 in the 
ECM simulations and empirical data sets, which show an 
increased variability between the rate matrices of the three 
codon positions over the simulations A-D (between 10% and 
30% more difference observed; data not shown). This has 
already been shown in other empirical data sets (Bofkin 
and Goldman 2007) and highlights the importance of 
correctly estimating the codon frequencies in such models 



(Aris-Brosou and Bielawski 2006). This result further suggests 
that the rate variability within codons captured by KChA^^x is 
modeling important aspects of protein-coding gene 
evolution. 

The variability within codon positions is in part associated 
with the codon usage bias. For example, the difference be- 
tween the AlCc values of KCMi9x and KCMy^ were highly 
dependent on the type of codon frequencies used (supple- 
mentary fig. S8, Supplementary Material online). This indi- 
cates that KCMi9x can incorporate codon usage bias 
through the three different rate matrices used to build this 
model. Therefore, this ability can rescue the model even when 
we assumed that the codon frequencies were equal. However, 
the better fit of KCMi9x remains even under the F3 x 4 type 
of codon frequencies, which indicates that other factors are 
affecting the amount of rate variability observed within codon 
positions (fig. 1 ). One of these factors could be selective pres- 
sure, which will push for substitution rates to differ between 
first, second, and third positions (Nei and Gojobori 1986; 
Goldman and Yang 1994). It would be important to further 
study this aspect to better understand the process behind the 
evolution of protein-coding evolution, but this is beyond the 
scope of this study. 

One of the advantages of mechanistic models is that their 
parameters are defined based on biological processes (Yang 

2006) and allow a direct test of the relevance of these param- 
eters through AlC or likelihood-based measures. In the case of 
KCM models, the parameters of the matrices are borrowed 
from the GTR model of nucleotide substitution (Tavare 
1986). However, we estimate them from codon data and 
not directly from nucleotide data and the biological meaning 
of the parameters included in q, is not straightforward. The 
other approach to incorporate double and triple substitutions 
in a mechanistic codon model (Whelan and Goldman 2004) 
did not have the same interdependency between codon 
positions that the Kronecker product introduces. Their rate 
matrices were therefore formed by rescaled single, double, 
and triple instantaneous substitution rates that are compa- 
rable with the events occurring in nucleotide model (Whelan 
and Goldman 2004). The effect of the interdependency in the 
KCM model is readily seen in the comparison between the 
partitioned codon data analyzed with the Mgene option of 
CodeML and the full estimation under the KCMi9xmo and 
KCMi9x models. The parameters of the former model were 
very similar to the parameters obtained in the three GTR 
substitution matrices, whereas the deviation increase to 
about 22% with the latter especially in the matrices defining 
the first and third codon positions. This suggests that the 
parameters of the q matrices in KCMi9x are nucleotide sub- 
stitution parameters in protein-coding area under codon con- 
straints, which can be compared with the exchangeability 
rates or replacement probabilities used in empirical codon 
models (Doron-Faigenboim and Pupko 2007; Kosiol et al. 

2007) . 

The mechanistic codon model that we present here has 19 
free parameters, which are combined to represent the full set 
of codon transition rates. It can therefore consider single, 
double, and triple nucleotide substitutions per codon 
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within a small interval of time and allows the modeling of rate 
variation between codon positions. The codon transition 
rates estimated by KCAAig^ are biologically relevant and 
our new model can have a better fit than current models 
on simulated and empirical data sets. However, the estima- 
tion of CO by KCM models slightly differ from the MO and ECM 
models. The estimation of the selective pressure when 
double and triple substitutions are incorporated is more com- 
plex (Kosiol et al. 2007; De Maio et al. 2013) and applying 
existing correction to the estimated co is not yet satisfying. 
Further investigations should be done to clarify the relation- 
ship between the co value estimated by KCMi9x and the co 
values of the standard model (MO) and empirical models 
(ECM). 

Conclusion 

The KCM model is an attempt to generalize mechanistic 
models of codon evolution. It uses a mathematical operator, 
the Kronecker product, to increase the number of effective 
parameters, which allows the inclusion of double and triple 
nucleotide substitutions. We show that the KCM model can 
lead to improvements of the likelihood when compared with 
traditional models of codon evolution. This has consequences 
on key parameters used to describe the evolution of protein- 
coding sequences. In particular, our simulations suggest that 
the effect of double and triple substitutions can be important 
for the identification of selective pressure. It is evident that 
assuming a single co value for all sites and branches has now 
been shown to be unrealistic (Zhang et al. 2005; Anisimova 
and Kosiol 2009); we can suppose that the biases that we 
observed when data with double and triple substitutions are 
analyzed under the MO model will be maintained. This is 
clearly calling for further studies to understand the potential 
extent of such bias and the extension of the KCM model that 
we proposed could represent one possibility. 

Materials and Methods 

Correcting Synonymous and Nonsynonymous Ratio 
The rate variation across nucleotide sites in a protein-coding 
region is usually assumed to be due to selection pressure, and 
this is modeled by the co parameter (Yang 2006). However, 
the KCM model allows for rate variation across nucleotide 
positions in a codon by considering a separate GTR matrix for 
each nucleotide position. The estimation of co as described in 
equation (4) is therefore biased and a correction has to be 
applied. To do so, we first estimate the synonymous substi- 
tution rate per codon (Goldman and Yang 1994) 

61 61 

/■= 1 j= 1 Jj^i , aa, =aaj 

Similarly, the nonsynonymous rate per codon can be 
calculated by summing njQjj over all codons /, j coding for 
different amino acids. Then, the synonymous and nonsynon- 
ymous rate per codon parameters and are estimated 
under neutral selection (Nei and Gojobori 1986; Goldman 



and Yang 1994) by using the same nucleotide substitution 
matrix for the three codon positions. This is obtained by 
setting CO = 1 and averaging the three nucleotide substitu- 
tion matrices used to estimate and p^. Then, 

COkCM = KJK, = PaP3~/PsPa~ (7) 

where (jO/<c/vi is the corrected co for the KCM model, Ks the 
number of synonymous substitutions per synonymous site, 
and /<a the number of nonsynonymous substitutions per 
nonsynonymous site. 

Simulated and Empirical Data Sets 
The performances of the KCM models were assessed using 
both simulated and empirical data sets. We created five sim- 
ulated data sets to assess the effects of the amount of double 
and triple substitutions on the estimation of the co parameter 
and the likelihood of the model. The first simulated data set, 

A, is not composed of any double and triple substitutions and 
was created by using the settings of the MO model. The next 
three simulated data sets, called B, C, and D, were obtained by 
using the R rate matrix of simulation A and adding randomly 
drawing double and triple rates from normal distributions 
with varying mean and variances (simulation B: 
A^CO.OOOI, 0.03); simulation C: A/'(0.001, 0.1); simulation 
D: A/'(0.1, 0.1)). This lead to approximately 11%, 30%, and 
39% of double and triple substitutions for these three simu- 
lation schemes, respectively. Moreover, we complemented 
our data sets by using an empirical rate matrix derived by 
Kosiol et al. (2007), which is based on empirical data from the 
Pandit database. This data set that we refer to as ECM is 
composed of approximately 25% of double and triple 
substitutions. 

The expected values of the co parameter for the data sets A, 

B, C, and D were initially set to 1. This was not possible for the 
ECM schemes, which has an unknown co value estimated by 
Kosiol et al. (2007) to be close to 0.3. We nevertheless created 
four other rate matrices by multiplying every substitution rate 
leading to a nonsynonymous change by 0.1, 0.5, 2, and 10. 
Given these rate matrices R (one for each factor), we ran- 
domly created 50 random trees with the R package ape (func- 
tion rtree with parameters br = rlnorm, mean= 0 .41, 
SD = 0.34; Paradis et al. 2004), each composed of 15 se- 
quences and simulated an alignment composed of 150 
codons. The mean and variance of the total branch length 
estimated under the MO model for the five simulated data 
sets describe the sequence divergence considered by our anal- 
ysis (see supplementary table S2, Supplementary Material 
online). We therefore simulated 5 simulation schemes, each 
with 5 different co values and replicated this 50 times for a 
total of 1,250 simulated data sets. For data set A, we used the 
original evolver software (Yang 2007), whereas we imple- 
mented a modified version of evolver package that takes as 
an input a 64 x 64 R matrix. We checked the effects of the 
small alignment length by repeating the simulations with 
3,500 codons for the simulations A, B, and ECM for co 
values 0.5 and 2.0. 
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Beside the simulated data sets, we assessed our new model 
on known protein-coding genes that included the following: 
1) P'Glob'm sequences containing 144 codons for 17 verte- 
brate species extracted from Gen Bank; 2) rbcL sequences 
from plants containing 447 codons for 20 species; and 3) 
pepC sequences also from plants containing 437 codons for 
20 species. The two plant data sets were part of larger studies 
(Christin et al. 2007, 2008) and included several hundreds of 
taxa. We randomly selected 20 species from the published 
aligned matrices of these genes to save computational time. 
We additionally evaluated the effects of the different KCM 
models on 100 empirical data sets (supplementary table S3, 
Supplementary Material online) randomly selected from 
Selectome database (Moretti et al. 2013) and compared the 
fit of these models with MO and MEG These data sets are 
representative of typical data sets analyzed by codon models 
because the co values ranged from 0.0054 to 8.66745, whereas 
the number of sequences and length of alignment is typical of 
current data sets (supplementary table SI, Supplementary 
Material online). 

The biological relevance of the Q matrices returned by the 
best KGM variant was assessed using the almost invariant 
sites approach (Kosiol et al. 2004) as implemented in the 
AIS software. This approach attempts to group codons into 
classes that have high probabilities of change within each class 
while having small probability of change between different 
classes. The analyses were performed with either 20 or 7 clas- 
ses of codons representing the 20 amino acids or 7 
physicochemical properties of amino acids. 

All estimations of maximum likelihood under the KGM 
model were done in a modified version of the GodeML soft- 
ware that is available at the address www.unil.ch/phylo/ 
Bioinformatics (last accessed July 11, 2014). 

Supplementary Material 

Supplementary figure S1-S8 and tables S1-S4 are available at 
Molecular Biology and Evolution online (http://www.mbe. 
oxfordjournals.org/). 
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